2023-03-23

Overview

  • Why bother?
  • What is Data Science?
  • What are Data Scientists?
  • What is Machine Learning?
  • What types of problems can Machine Learning solve?
  • AI ≠ Machine Learning
  • Bias in AI

Why bother?

Google searches for "Machine Learning"

[source: trends.google.com]

What is Data Science?

What is Data Science?

Data Science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from data in various forms, both structured and unstructured.

[Source: Wikipedia]

What is Data Science?

Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from data in various forms, both structured and unstructured.

[source: gartner.com]

What is Data Science?

Data Science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from data in various forms, both structured and unstructured.

[source: fscj.edu]

What is Data Science?

Data Science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from data in various forms, both structured and unstructured.

[source: bdbizviz.com]

What is Data Science?

Data Science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from data in various forms, both structured and unstructured.

[source: wikipedia.org]

What is Data Science?

Data Science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from data in various forms, both structured and unstructured.

[source: carestruck.org]

What are Data Scientists?

Stats on Data Scientists

[www.oreilly.com/data/free/2017-data-science-salary-survey.csp]

What industries do Data Scientists work in?

How do Data Scientists spent their time?

What tasks do Data Scientists work on?

What education do Data Scientists have?

What do Data Scientists earn?

What is Machine Learning (ML)?

Machine Learning – Definition

A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E.

[Tom M. Mitchell]

 

→ Template to describe complex problems with less ambiguity:

  • Experience E = Data required
  • Task T = Problem (class) to solve
  • Measure P = Metric to evaluate results

Example – Detect spam emails

  • Experience EData set of emails with examples of spam and ham
  • Task T = Classify an email as either spam or ham
  • Measure P = Accuracy of correctly classified emails

→ Preparing decision making program to solve this task is called training

→ Collected email examples are called the training set

→ The program is referred to as a model
(as in a model of the problem of classifying spam from non-spam)

Types of Machine Learning

Machine Learning tasks T typically classified into two broad categories:

Supervised learning:

The computer is presented with example inputs and their desired outputs, given by a "teacher", and the goal is to learn a general rule that maps inputs to outputs.

Unsupervised learning:

No labels are given to the learning algorithm, leaving it on its own to find structure in its input. Unsupervised learning can be a goal in itself (discovering hidden patterns in data) or a means towards an end (feature learning).

AI ≠ Machine Learning?

AI is the study of "intelligent agents": any device that perceives its environment and takes actions that maximize its chance of successfully achieving its goals

[Poole, Mackworth, Goebel (1998). Computational Intelligence: A Logical Approach]

Machine learning is a subset of artificial intelligence that is concerned with the construction and study of systems that can learn from data.

What types of problems can ML solve?

What types of problems can ML solve?

  • Classification
  • Regression
  • Association rules
  • Clustering
  • Recommending

Classification

  • Given a set of records, \(X = \{x_1 , \dots , x_n \}\)
  • Each record \(x_i = \{x_{i_1} , \dots, x_{i_m} \}\) is a set of \(m\) attributes
  • Each record has additional attribute \(l_i \in L\), with \(L\) being finite set of class labels
  • Find a function \(f\) such that \(f (x_i) \approx l_i\)
  • Task: Compute \(l_j\) for previously unseen records \(x_j\) as accurately as possible

Classification – Direct marketing example

  • Goal: Reduce cost of mailing by targeting a set of consumers likely to buy a product
  • Approach:
    • Use historic data of the same or a similar product
    • We know which customers decided (not) to buy
    • {buy, don’t buy} are labels to learn
    • Collect various demographic, lifestyle, transaction information about customers (type of business, where they live, how much they earn, etc.)
    • Use this information as input attributes to learn a classifier mode

Classification – Fraud detection

  • Goal: Predict fraudulent credit card transactions
  • Approach:
    • Use credit card transactions and information on its account-holder as attributes (when does a customer buy, what does he buy, how often he pays on time, etc)
    • Label past transactions as fraud or fair transactions → label
    • Learn a model for the class of the transactions
    • Use this model to detect fraud by observing credit card transactions on an account

Classification – Customer churn

  • Goal: Predict whether a customer is likely to be lost
  • Approach:
    • Use transaction records to capture customer behaviour
      (attributes around recency and frequency of service usage and transaction volumes)
    • Enrich transactions with demographics and customer specific data
    • Define "churn" and label customers
    • Find a model for churners
    • Start predicting

Regression

  • Given a set of records, \(X = \{x_1 , \dots , x_n \}\)
  • Each record \(x_i = \{x_{i_1} , \dots, x_{i_m} \}\) is a set of \(m\) attributes
  • Each record has additional attribute \(y_i \in \mathbb{R}\) (prediction target)
  • Find a function \(f\) such that \(f (x_i) \approx y_i\)
  • Task: Compute \(y_j\) for previously unseen records \(x_j\) as accurately as possible

Regression – Predicting house prices

  • Goal: Predict house price
  • Approach:
    • Collect property data (characteristics, year built, location, school zones, sales season)
    • Link property with sales history
    • Train a model for the sales price
    • Use model to predict sales prices of unseen houses
  • Commercial examples:

Regression – Modeling salaries

[source: www.oreilly.com/data/free/2017-data-science-salary-survey.csp]

Association rules

  • Ideas come from market basket analysis
  • What products are frequently bought together?
  • How should shelves be managed?
  • What impact does the discontinuation of a product have on the sales of other products?

Goal: Find frequent/interesting patterns, associations, correlations among sets of items in a transactional database

Association Rules – Marketing & sales promotion

  • Let the rule discovered be:
    {Bagels} → {Potato chips}
  • Potato chips as consequent
    → Determine what should be done to boost its sales
  • Bagels in the antecedent
    → Which products would be affected if the store discontinues selling bagels

Association Rules – Shelf management

  • Goal: Identify items that are bought together by many customers
  • Approach: Process the point-of-sale data collected find groups of items frequently bought together
  • Classic rule in literature:
    If a customer buys diaper and milk, then he is very likely to buy beer

Clustering

  • Task: Given a set of data points and a similarity measure among them, find clusters such that
    • Data points in same cluster are more similar to one another
    • Data points in separate clusters are less similar to one another
  • Similarity Measures:
    • Euclidean distance (continuous attributes)
    • Problem-specific measures (sounds)

Clustering – Principle

Example: Assume we are given the following records

Clustering – Principle

We could group the records into 4 different clusters…

Clustering – Principle

… or into 5 clusters

Clustering – Market segmentation

  • Goal: Subdivide a market into distinct subsets of customers where any subset may conceivably be selected as a market target
  • Approach:
    • Collect different attributes of customers based on their geographical and lifestyle related information
    • Find clusters of "similar" customers
    • Measure the clustering quality by observing buying patterns of customers in same cluster vs. those from different clusters

Recommending

  • Estimate a utility function that predicts how a user will like an item
  • Estimation based on:
    past behaviour, relations to other users, item similarity, context

[source: towardsdatascience.com]

Recommending – Online retail products

  • Goal: Suggest products to the user of an online retail store to increase sales
  • Approach:
    • Identify similar users based on their past purchase behaviour
      → recommend their purchases
    • Recommend similar products that a user has purchased in the past
    • If purchase history not available, recommend best-selling products

Data Science Process

Lots of confusion about what ML is…

[source: xkcd.com]

Real life examples of biased AI applications

Health Care

  • In 2019 predictive analytics used in US hospitals was shown to favor white over black patients
  • Race was not a variable in the algorithm but other variables, i.e. healthcare cost history, acted as a proxy
  • The rationale was that cost summarizes how many healthcare needs a particular person has
  • Since black patients often incurred lower healthcare costs than white patients, the algorithm favored white patients

Scientific American Article

Real life examples of biased AI applications

Correctional Offender Management

  • US court systems uses algorithms to predict the likelihood of a defendant become a recidivist
  • It was shown that the model predicted twice as many false positives for recidivism for black offenders (45%) than white offenders (23%)

ProPublica Article

Real life examples of biased AI applications

Amazon Recruiting

  • Algorithms used to mechanize the search for top talent
  • Data used for model building was highly skewed for male resumes which resulted in the algorithm to favor male applicants

The Guardian Article

Bias in AI applications

What can we do to reduce bias in AI?

  • Recognition of impact of bias in AI modelling
  • Data that one uses needs to represent “what should be” and not “what is”
  • More attention to identify potential bias using tools to examine datasets and suggest corrections
  • Post hoc assessment of bias in algorithm’s output
  • Consumers of AI predictions (i.e. businesses) need adequate understanding of model limitations and proper interpretation

  • Good article – Bias and ethical considerations in machine learning and the automation of risk assessment

Questions?